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Abstract Recently, we have developed a coronavirus-specific 
gene-finding system, ZCURVE_CoV 1.0. In this paper, the sys¬ 
tem is further improved by taking the prediction of cleavage 
sites of viral proteinases in polyproteins into account. The cleav¬ 
age sites of the 3C-like proteinase and papain-like proteinase 
are highly conserved. Based on the method of traditional posi¬ 
tional weight matrix trained by the peptides around cleavage 
sites, the present method also sufficiently considers the length 
conservation of non-structural proteins cleaved by the 3C-like 
proteinase and papain-like proteinase to reduce the false positive 
prediction rate. The improved system, ZCURVE_CoV 2.0, has 
been run for each of the 24 completely sequenced coronavirus 
genomes in GenBank. Consequently, all the non-structural pro¬ 
teins in the 24 genomes are accurately predicted. Compared 
with known annotations, the performance of the present method 
is satisfactory. The software ZCURVE_CoV 2.0 is freely avail¬ 
able at http://tubic.tju.edu.cn/sars/. 

© 2003 Published by Elsevier B.V. on behalf of the Federation 
of European Biochemical Societies. 
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1. Introduction 

Due to the severity of a life-threatening disease, referred to 
as severe acute respiratory syndrome (SARS), the World 
Health Organization (WHO) has issued a global alert for 
the illness. SARS apparently began in Guangdong province 
of China in November 2002, and has spread to Hong Kong, 
Singapore, Vietnam, Canada, the USA and several European 
countries [1-6]. By early June 2003, more than 700 SARS- 
related deaths were recorded by WHO (http://www.who.int/ 
csr/sars/country/en/). 

A novel coronavirus, called SARS-coronavirus or SARS- 
CoV, has been proved to be the cause of SARS. The corona- 
viruses (order Nidovirales, family Coronaviridae, genus Coro¬ 
navirus) are members of a family of large, enveloped, positive- 
stranded RNA viruses that replicate in the cytoplasm of ani¬ 
mal host cells [7]. There are three groups of coronaviruses; 
groups I and II contain mammalian viruses, while group III 
contains only avian viruses. The viruses are associated with a 
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variety of diseases in humans and domestic animals, including 
gastroenteritis and diseases of the upper and lower respiratory 
tract. Many researchers have analyzed the phylogeny of 
SARS-CoV and concluded that it is not closely related to 
any of the previously characterized coronaviruses and forms 
a distinct group (group IV) within the genus Coronavirus [7,8]. 
At the time this paper was written, there were 12 strains of 
SARS-CoV complete genome sequences available from Gen- 
Bank [7-9]. Among these genomes, six have been annotated 
manually, and the remaining six have not been annotated yet. 
The genomic organization of SARS-CoV is that of a typical 
coronavirus, with the order of the characteristic genes being 
replicase [rep], spike [S], envelope [E], membrane [M], nucle- 
ocapsid [N] from the 5' to the 3' terminus. SARS-CoV also 
encodes a number of non-structural proteins located between 
S and E, between M and N, or downstream of N with un¬ 
known functions. We have developed a coronavirus-specific 
gene-finding system ZCURVE_CoV 1.0 [10], which is espe¬ 
cially suitable for gene recognition in SARS-CoV genomes. 
The software has the advantages of simplicity, reliability, 
high accuracy and quickness and can be obtained freely 
at the website http://tubic.tju.edu.cn/sars/. The system 
ZCURVE_CoV 1.0 has been run for each of the 12 SARS- 
CoV genomes. In addition to the polyprotein chains Orfla 
and Orflb and the four genes encoding the major structural 
proteins, S, E, M and N, respectively, ZCURVE_CoV 1.0 also 
predicts five to six putative proteins between 39 and 274 ami¬ 
no acids in length, with unknown functions in SARS-CoV 
genomes. However, the cleavage sites of viral proteinase in 
replicases are not predicted in ZCURVE_CoV 1.0. 

The coronavirus replicases are encoded by two large, 5'- 
proximal open reading frames (ORFs) that comprise approx¬ 
imately two-thirds of the genome. Polyproteins ORFla and 
ORFlb are connected by a ribosomal frameshift site, which is 
believed to occur at the conserved ‘slippery sequence’, 
UUUAAAC. It results in the translation of an ORFla protein 
and a carboxyl-extended ORFlab frameshift protein, which 
are also known as replicase polyproteins ppla and pplab [11]. 
The ORFla and ORFlab translation products are polypro- 
tein precursors, which are cleaved by viral proteinases, result¬ 
ing in a minimum of 13 non-structural proteins, including a 
3C-like proteinase, an RNA-dependent RNA polymerase, an 
ATPase/helicase and other function-unknown non-structural 
proteins [11]. These proteins in turn are responsible for repli¬ 
cating the viral genome as well as generating nested tran¬ 
scripts that are used in the synthesis of viral proteins. In 
this paper, all the putative non-structural proteins resulting 
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Table 1 

The lengths for 11 non-structural proteins 3 cleaved by the 3C-like proteinase 
Genome The length of non-structural proteins (aa) 



nsp2 

nsp3 

nsp4 

nsp5 

nsp6 

nsp7 

nsp9 

nsplO 

nspll 

nspl2 

nspl3 

TOR2 

306 

290 

83 

198 

113 

139 

932 

601 

527 

346 

298 

HCoV-229E b 

302 

279 

83 

195 

109 

135 

927 

597 

518 

348 

300 

MHV b 

303 

287 

92 

194 

110 

137 

928 

600 

521 

374 

299 

BCoV 

303 

287 

89 

197 

110 

137 

928 

603 

521 

374 

299 

IBV b 

307 

293 

83 

210 

111 

145 

940 

600 

521 

338 

302 

TGEV 

302 

294 

83 

195 

111 

135 

929 

599 

519 

339 

300 

PEDV 

302 

280 

83 

195 

108 

135 

927 

597 

517 

339 

301 

Average length 0 

304 

287 

85 

198 

110 

138 

930 

600 

521 

351 

300 

Standard deviation 

2.07 

5.87 

3.76 

5.59 

1.60 

3.60 

4.67 

2.15 

3.26 

16.07 

1.35 


“These proteins are cleaved by the 3C-like proteinase within polyprotein lab derived from the seven coronavirus genomes annotated by NCBI. 
b The cleavage sites have been confirmed by experimental evidence in these genomes. 

c The genomes that have maximum lengths for nsp2-13 except nsp8 are IBV, TGEV, MHV. IBV, TOR2, IBV, IBV, BCoV, TOR2, MHV 
(BCoV) and IBV respectively. The genomes that have the minimum lengths for nsp2-13 except nsp8 are HCoV-229E (TGEV, PEDV). HCoV- 
229E, TOR2 (HCoV-229E, IBV, TGEV, PEDV), MHV. PEDV, HCoV-229E (TGEV, PEDV), HCoV-229E (PEDV), HCoV-229E (PEDV), 
PEDV, IBV and TOR2, respectively. 


from the cleavage by viral proteinases in the polyproteins are 
precisely predicted using ZCURVE_CoV 2.0. 

2. Materials and methods 

Seven genomic sequences of coronaviruses and the annotation in¬ 
formation were downloaded from the NCBI RefSeq project. These 
coronaviruses include avian infectious bronchitis virus (IBV) 
(NC_001451), bovine coronavirus (BCoV) (NC_003045), human co¬ 
ronavirus 229E (HCoV-229E) (NC_002645), murine hepatitis virus 
(MHV) (NC_001846), porcine epidemic diarrhea virus (PEDV) 
(NC_003436), SARS coronavirus TOR2 (TOR2) (NC_004718) and 
transmissible gastroenteritis virus (TGEV) (NC_002306). The above 
genomes have been annotated by NCBI and the sequences of mature 
peptides are available. According to the annotation, a total of 77 sites 
cleaved by the 3C-like proteinase and 17 sites cleaved by the papain¬ 
like proteinase were extracted from the above seven genomes. Octa- 
peptides cleaved by the 3C-like proteinase and 12-mer peptides 
cleaved by the papain-like proteinase were used to train the corre¬ 
sponding positional weight matrix (PWM) [12]. The cleavage site is at 
the center of the octapeptide or 12-mer peptide. The length distribu¬ 
tion of non-structural proteins within ORFlab was also derived front 
the annotated genomes. At the time this paper was written, (here 
were 24 complete sequences of coronavirus genomes available in 
the GenBank database, of which 12 are SARS-CoVs and 12 are 
other groups of coronaviruses. The former comprises SARS-CoV 
TOR2 (NC_004718), Urbani (AY278741), HKU-39849 (AY278491), 
CUHK-W1 (AY278554), BJ01 (AY278488), CUHK-SulO (AY282752), 
SIN2500 (AY283794), SIN2748 (AY283797), SIN2679 (AY283796), 
SIN2774 (AY283798), SIN2677 (AY283795) and TW1 (AY291451), 
whereas the latter comprises IBV (NC_001451), BCoV (NC_003045), 
bovine coronavirus strain Mebus (BCoVM) (U00735), bovine coro¬ 
navirus isolate BCoV-LUN (BCoVL) (AF391542), bovine coronavirus 
strain Quebec (BCoVQ) (AF220295), HCoV-229E (NC_002645), 
MHV (NC_001846), murine hepatitis virus strain ML-10 (MHVM) 
(AF208067), murine hepatitis virus strain 2 (MHV2) (AF201929), mu¬ 
rine hepatitis virus strain Penn 97-1 (MHVP) (AF208066), PEDV 
(NC_003436) and TGEV (NC_002306). 

The mature peptides cleaved by the 3C-like proteinase are highly 
conserved in length among different groups of coronaviruses, while 
others cleaved by the papain-like proteinase are not so conserved. The 
lengths of all the non-structural proteins cleaved by the 3C-like pro¬ 
teinase within polyprotein lab are listed in Table 1, while the lengths 
for the non-structural proteins cleaved by the papain-like proteinase 
are listed in Table 2. The average length and standard deviation for 
each kind of non-structural proteins are calculated. As shown in Ta¬ 
bles 1 and 2, the lengths of the non-structural proteins cleaved by the 
3C-like proteinase are highly conserved, while the lengths and the 
number of the papain-like cysteine proteinase cleavage products (ab¬ 
breviated as PCP CP) appear to be irregular. Since the NCBI anno¬ 
tations are not always correct, the annotations of cleavage products of 


the papain-like proteinase may be incomplete. It is observed that the 
size of the annotated PCP CP3 of SARS-CoV, MHV and IBV is 
approximately the sum of the sizes of PCP CP3 and PCP CP4 of 
other mammalian coronaviruses listed in Table 2. Therefore, the 
PCP CP3 of SARS-CoV, MHV and IBV may be further cleaved, 
i.e. it is possible that another papain-like proteinase cleavage site is 
present in the PCP CP3 of SARS-CoV, MHV and IBV. Based on the 
above analysis, a cleavage model of the papain-like proteinase is pre¬ 
sented schematically in Fig. 1. According to this model, all coronavi¬ 
ruses have four non-structural proteins cleaved by the papain-like 
proteinase. Consequently, the cleavage products of the papain-like 
proteinase predicted by this model show the conservation in both their 
length and number. The average length and standard deviation for 
each papain-like proteinase cleavage product are estimated based on 
the genomes of BCoV, HCoV-229E, TGEV and PEDV, in which four 
of the papain-like proteinase cleavage products are annotated (see 
Table 2). Fig. 2A,B shows the conservation sites cleaved by the SC- 
like proteinase and papain-like proteinase, respectively. It can be seen 
that both the 3C-like proteinase and papain-like proteinase have con¬ 
served cleavage sites. The same arrangement order of the cleavage 
products in polyprotein lab, similar sizes of non-structural proteins 
and the conserved residues in the cleavable peptides form the basis of 
the present algorithm to predict cleavage sites of polyproteins. Here, 
the method is described briefly as follows. 

First, ORFlab and the slippery sequences are identified using 
ZCURVE_CoV 1.0. Subsequently, the predicted ORFlab is trans¬ 
lated into amino acid sequence. Starting from the C-terminus of the 
predicted ORFlab polyprotein, the candidate cleavage site of nspl3 is 
searched within a particular region using the sliding-window tech- 


Table 2 

The lengths for the non-structural proteins 3 cleaved by the papain¬ 
like proteinase 


Genome 

Length (aa) 



PCP CPI 

PCP CP2 

PCP CP3 

PCP CP4 

IBV 

— 

673 b 

2106 

- 

TOR2 

179 

639 

2422 

- 

MHV 

247 b 

585 b 

2501 

- 

BCoV 

246 

605 

1899 

496 

HCoV-229E 

11 l b 

786 

1587 b 

481 

TGEV 

108 

771 

1509 

490 

PEDV 

110 

785 

1622 

480 

Average length 0 

144 

737 

1654 

487 

Standard deviation 0 

68.18 

88.10 

169.87 

7.63 


“These proteins are cleaved by the papain-like proteinase within 
polyprotein lab derived from the seven coronavirus genomes anno¬ 
tated by NCBI. 

b These cleavage products have been confirmed by experimental evi¬ 
dence. 

“The average length and standard deviation are calculated based on 
the genomes of BCoV, HCoV-229E, TGEV and PEDV. 
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Fig. 1. Comparison between the N-terminal sequences of the polyprotein labs in MHV and BCoV is shown schematically. The additional 
cleavage site in the annotated PCP CP3 predicted by the present method for MHV is situated at the corresponding position where the PCP 
CP3 and PCP CP4 are cleaved in BCoV. Cleavage sites that have been annotated by NCBI are indicated by black arrows, while the cleavage 
site predicted by the present method is indicated by an open arrow. 


nique. The distance between the scanning region center and the C-ter- 
minus of polyprotein lab should be equal to the average length of 
nspl3. Denoting the center position by c, a window with an octapep- 
tide size slides from the positions c—3S to r+3<5, where S is the stan¬ 
dard deviation of the length distribution for nspl3 (see Table 1). 
Given an octapeptide within the region S = X 4 X 3 X 2 XiXi'X 2 'X 3 'X 4 ', 
where X, (i = 4, 3, 2, 1, 1', 2', 3', 4') represents the amino acid at the 
position P;, the score of the octapeptide is computed as 

4 ' 

Score (X 4 X 3 X 2 X 1 X 1 ,X 2 ,X 3 ,X 4 .) = fj/(q X,) (1) 

i=4 

where/j7, X,) (i = 4, 3, 2, 1. T, 2', 3', 4') is the frequency of amino 
acid X, occurring at the position P„ which is an element in the cor¬ 
responding positional weight matrix. The site with maximum score is 
selected as a candidate site. Consequently, the cleavage site of 
nspl2113 is determined and nspl3 is found. 

Prediction of other cleavage sites is performed in a recurrent way. 
Once the cleavage site of nspl2113 is determined, the next cleavage 
site to be predicted is nspl 1112, then nspl0111, and so forth until 
nspl|2. Generally, if the site of nspfc 1(^+1) is determined, the next 
target is to predict the site of nsp(A:— 1) | k, where k= 12, 11, ..., 2, but 
k=£ 8 (see the explanation below). For clarity, take k = 6 as an example, 
where the site of nsp6|7 is known. First, the center position and the 
sliding window used for identifying the site of nsp5|6 need to be 
determined. The center position c is situated upstream of the site of 
nsp6|7. The distance between the center position c and the site of 
nsp6|7 should be equal to le, which is the average length of nsp6. 
In Table 1, we find ?6 = 110 aa and the standard deviation S of the 
length distribution for nsp6 is 1.6. A window with an octapeptide size 
thus slides from the position c— 3S~c— 5 to c+3<5~c+5. Second, the 
site with the highest score is predicted to be the candidate site of 
nsp5|6. Note that in some cases the scores may be zero because of 
the limited training samples. In this case, a very small quantity (0.001) 
is assigned to the zero elements in the positional weight matrix. Also 
note that the nsp7|8 site is cleaved in polyprotein la, while the nsp7|9 
site is cleaved in polyprotein lab. Therefore, the cleavage sites of 
nsp718 and nsp719 are in fact the same, leading to the result of 
k^8. Furthermore, if the following two conditions are satisfied, be¬ 
sides the site with the maximum score, the site with the second max¬ 
imum score is also taken into account: (i) Gin and Leu are found at 
the Pi and P 2 positions, respectively; (ii) the distance between the two 
sites is less than five amino acid residues. This procedure considers the 
prediction of two adjacent cleavage sites in the scanning window. 
Consequently, two alternative cleavage sites annotated by NCBI are 
also found in the genomes of MHV and BCoV. Note that such cases 
occur rarely in the genomes studied. 

Repeating the above procedure 11 times, all of the mature peptides 
cleaved by the 3C-like proteinase are identified one by one. Then, the 
papain-like proteinase cleavage products are searched within the re¬ 
maining regions of polyprotein lab. A similar recurrent procedure is 


performed to search for the papain-like proteinase cleavage sites. The 
scores of 12-mer peptides are calculated as described above. The cen¬ 
ter position and the size of the sliding window used to search for the 
papain-like proteinase cleavage sites are determined in a way similar 
to that used for the 3C-like proteinase. The sites associated with the 
maximum scores in the corresponding scanning regions are predicted 
to be cleavage sites. Consequently, three papain-like proteinase cleav¬ 
age sites are predicted for each genome. 




Fig. 2. Conservation of the sites cleaved by coronavirus proteinases. 
Two separate multiple, gap-free alignments around the PI I PI' posi¬ 
tions of the sites cleaved by the 3C-like proteinase (A) and papain¬ 
like proteinase (B) in the training set are converted to logo presen¬ 
tations in which the size of an amino acid is proportional to its 
conservation at the specific position and the sampling size. The ami¬ 
no acid conservation is measured in bits of information plotted on 
a vertical axis whose upper limit is determined by the natural diver¬ 
sity of amino acids (20) expressed as a logarithm of 2 [16], Seventy- 
seven sites cleaved by the 3C-like proteinase were used to generate 
the logo in A, and 17 sites cleaved by the papain-like proteinase 
were used to generate the logo in B. 
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3. Results and discussion 

Replicase polyprotein processing is carried out by two or 
three ORF la-encoded viral proteinases. Corona viruses encode 
a chymotrypsin-like proteinase, 3C-like proteinase, which is 
analogous to the main picornaviral proteinase, 3C proteinase 
[11]. As mentioned above, the cleavage sites of the 3C-like 
proteinase are highly conserved. As shown in Fig. 2A, the 
Pi position of the peptide sequence is exclusively occupied 
by Gin. Leu is dominant at the P 2 position (more than 
75%) and Val, Ser, Thr and Pro are clearly favored at the 
P 4 position. At the P^ position, small, aliphatic residues 
(Ser, Ala, Asn, Gly and Cys) are found, of which the content 
of Ser is more than 50%. There are no highly favored residues 
at the P 3 , P 2 <, P 3 ' and P 4 / positions. The length distributions of 
each of the 11 non-structural proteins cleaved by the 3C-like 
proteinase in the annotated genomes are listed in Table 1. Of 
these non-structural proteins, nsp2 is the putative 3C-like pro¬ 
teinase; nsp3 contains a hydrophobic domain; nsp7 is known 
as a growth-factor-like protein; nsp9 is the putative RNA- 
dependent RNA polymerase; nsplO contains a metal ion¬ 
binding domain and NTPase/helicase domain. Recently the 
mRNA cap-1 methyltransferase function has been assigned 
to nspl3 [13], The functions of other non-structural proteins 
are unknown. Moreover, coronaviruses also encode one 
(group III) or two (groups I and II) papain-like proteinases, 


which are analogous to the foot and mouth disease virus 
leader proteinase. SARS-CoV appears to contain only one 
papain-like proteinase domain in the predicted gene product 
of ORF la [7]. For the papain-like proteinase, the cleavage 
sites are also conserved, but not as conserved as those of 
the 3C-like proteinase. Gly and Ala are found at the Pi posi¬ 
tion and Gly accounts for more than 75%. At the P 2 and Pj/ 
positions, Gly is also the dominant residue, which accounts 
for more than 45% and 50%, respectively. No residues exceed 
40% at other positions. In this study, similar sizes of non- 
structural proteins and conserved cleavage sites form the basis 
of the present algorithm. 

The performance of the algorithm is satisfactory by com¬ 
paring the predicted results with known annotations. 
Although all the SARS genomes have been annotated by in 
silico analysis so far, some annotations for other coronavi¬ 
ruses, such as IBV, MHV and HCoV-229E, are supported 
by experimental evidence [11], The jack-knife (leave-one-out) 
test has been performed here to ensure the validation of the 
prediction results for the cleavage sites of the 3C-like protein¬ 
ase. By the jack-knife test, each genome out of the seven ge¬ 
nomes under study is singled out in turn, and used as a testing 
genome. The remaining six genomes are used as the training 
set. Based on the data derived from the six training genomes, 
the cleavage sites of the 3C-like proteinase in the testing ge¬ 
nome are predicted and evaluated. The jack-knife test was 


Table 3 

Comparison of the predicted results for TGEV and PEDV with those annotated by NCBP 


Number 

Genome 

Location (bp) 

Start Stop 

Location (aa) 

Start Stop 

Length (aa) 

Cleavable peptide 

Feature 

1 

TGEV 

315 

638 

1 

108 

108 

- 

PCP CPI 


NC_002306 

639 

2951 

109 

879 

771 

KIARTG1RGAIYV 

PCP CP2 



2 952 

7478 

880 

2 388 

1 509 

YNKMGG1GDKTV S 

PCP CP3 



7479 

8 948 

2 389 

2 878 

490 

VSPKSG1SGFFDV 

PCP CP4 



8 949 

9 854 

2 879 

3180 

302 

STLQISGLR 

nsp 2 



9 855 

10 736 

3 181 

3474 

294 

VNLQIAGKV 

nsp3 



10 737 

10 985 

3 475 

3 557 

83 

STVQISKLT 

nsp4 



10986 

11 570 

3 558 

3 752 

195 

TILQISVAS 

nsp5 



11 571 

11903 

3 753 

3 863 

111 

TKLQINNEI 

nsp 6 



11904 

12 308 

3 864 

3998 

135 

VRLQIAGKP 

nsp7 



12 309 

15 094 

3 999 

4927 

929 

TSMQISFTV 

nsp9 b 



15 095 c 

16891° 

4928 

5 526 

599 

TVLQIAAGM 

nsplO 



16 892° 

18 448° 

5 527 

6045 

519 

IGLQIAKPE 

nspl 1 



18 449° 

19465° 

6 046 

6 384 

339 

KALQISLEN 

nspl 2 



19466° 

20 365° 

6 385 

6 684 

300 

PQLQISAEW 

nspl 3 

2 

PEDV 

297 

626 

1 

110 

110 

- 

PCP CPI 


NC_003436 

627 

2981 

111 

895 

785 

FGRRGG1NIVPVD 

PCP CP2 



2 982 

7 847 

896 

2517 

1622 

FKKKGG1GDVKFS 

PCP CP3 



7 848 

9 287 

2518 

2 997 

480 

ANKKGA | GLPSFS 

PCP CP4 



9 288 

10193 

2 998 

3 299 

302 

STLQIAGLR 

nsp 2 



10194 

11033 

3 300 

3 579 

280 

VNLQIGGYV 

nsp3 



11034 

11282 

3 580 

3 662 

83 

SSVQISKLT 

nsp4 



11283 

11 867 

3 663 

3 857 

195 

SMLQISVAS 

nsp5 



11 868 

12192 

3 858 

3 965 

108 

VKLQINNEI 

nsp 6 



12191 

12 596 

3 966 

4100 

135 

VRLQIAGKQ 

nsp7 



12 597 

15 376 

4101 

5 027 

927 

SIMQISTDM 

nsp9 d 



15 377 

17167 

5 028 

5 624 

597 

AVLQISAGL 

nsplO 



17168 

18718 

5 625 

6141 

517 

SDLQIANEG 

nspl 1 



18719 

19 735 

6142 

6480 

339 

NNLQIGLEN 

nspl 2 



19 736 

20 638 

6481 

6781 

301 

PQLQIASEW 

nspl 3 


“Note that of the 24 coronavirus genomes, the predicted results by ZCURVE_CoV 2.0 are in complete agreement with those annotated by 
NCBI, except for the genomes of TGEV and PEDV, in which the predicted results are different from those annotated by NCBI. In this table 
the reasons for these conflicts are analyzed. 

b This conflict with the annotation is caused by the problematic annotation. 

c The locations are different from the annotation, which is caused by a questionable additional insertion of an amino acid residue in nsp9. 
d This conflict with the annotation is caused by the non-standard frameshift. 
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Table 4 

The predicted results by the present method for BCoVL and SARS-CoV BJ01 


Number 

Genome 

Location (bp) 

Start Stop 

Location (aa) 

Start Stop 

Length (aa) 

Cleavable peptide 

Feature 

1 

BCoVL 

211 

948 

1 

246 

246 

- 

PCP CPI 


AF391542 

949 

2 763 

247 

851 

605 

IRGYRG1VKPLLY 

PCP CP2 



2 764 

8460 

852 

2 750 

1 899 

WRVPCAIGRRVTF 

PCP CP3 



8461 

9 948 

2751 

3246 

496 

FSLKGGI AVFSYF 

PCP CP4 



9 949 

10 857 

3247 

3 549 

303 

SFLQISGIV 

nsp2 



10 858 

11718 

3 550 

3 836 

287 

IKLQISKRT 

nsp3 



11719 

11985 

3 837 

3925 

89 

SQFQISKLT 

nsp4 



11986 

12 576 

3 926 

4122 

197 

TVLQIALQS 3 

nsp5 



12 577 

12 906 

4123 

4232 

110 

TVLQINNEL 

nsp6 



12 907 

13317 

4233 

4 369 

137 

VRLQIAGTA 

nsp7 



13318 

16100 

4 370 

5 297 

928 

TTVQISKDT 

nsp9 



16101 

17 909 

5 298 

5 900 

603 

AVMQISVGA 

nsplO 



17910 

19472 

5901 

6421 

521 

TRVQICSTN 

nspl 1 



19473 

20 594 

6422 

6 795 

374 

TKLQISLEN 

nspl2 



20 595 

21491 

6 796 

7094 

299 

PRLQIAASD 

nspl 3 

2 

BJ01 

246 

782 

1 

179 

179 

- 

PCP CPI 


AY278488 

783 

2699 

180 

818 

639 

TRELNG1GAVTRY 

PCP CP2 



2 700 

8465 

819 

2 740 

1922 

FRLKGG1APIKGV 

PCP CP3 



8 466 

9965 

2 741 

3240 

500 

ISLKGG |KIVSTC b 

PCP CP4 



9 966 

10883 

3 241 

3 546 

306 

AVLQISGFR 

nsp2 



10 884 

11753 

3 547 

3 836 

290 

VTFQIGKFK 

nsp3 



11 754 

12 002 

3 837 

3919 

83 

ATVQISKMS 

nsp4 



12003 

12 596 

3 920 

4117 

198 

ATLQIAIAS 

nsp5 



12 597 

12935 

4118 

4230 

113 

VKLQINNEL 

nsp6 



12936 

13 352 

4231 

4 369 

139 

VRLQIAGNA 

nsp7 



13 353 

16147 

4 370 

5 301 

932 

PLMQISADA 

nsp9 



16148 

17950 

5 302 

5 902 

601 

TVLQIAVGA 

nsplO 



17951 

19531 

5 903 

6429 

527 

ATLQIAENV 

nspl 1 



19 532 

20 569 

6430 

6 775 

346 

TRLQISLEN 

nspl 2 



20 570 

21463 

6 776 

7073 

298 

PKLQIASQA 

nspl 3 


“The alternative cleavage site predicted by the present method is at QALQISEFV (Gln-3928|Ser-3929). 
b Compared with the annotation, this cleavage site is predicted additionally by the present method. 


finished by repeating the above procedure seven times. Con¬ 
sequently, the predicted results by the jack-knife test are 
found to be as good as those by a self-consistency test men¬ 
tioned previously, suggesting that the prediction results are 
reliable. 

The prediction results for TGEV and PEDV, which are 
different from the annotations of NCBI RefSeq projects, are 
listed in Table 3. The prediction results for other genomes can 
be obtained from the supplementary materials (http://tubic. 
tju.edu.cn/sars/). The coronavirus —1 frameshift site [14] is 
believed to occur at the ‘slippery sequence’, UUUAAAC. 
This assumption has been supported by experimental evidence 
[15], But the annotated frameshift sites are not always consis¬ 
tent with this pattern, as in the case of PEDV, whose frame- 
shift site lies upstream of the UUUAAAC sequence according 
to the annotation. This may be due to the questionable anno¬ 
tation. For example, the genomes of MHV and BCoV were 
originally annotated by the authors as the ones having a non¬ 
standard frameshift site, however, these conclusions were then 
corrected by the re-annotations of NCBI as the ones having 
standard frameshift sites. In light of this, we adopt UUUA¬ 
AAC as the standard slippery sequence. 

Using the present method, only few false positive predic¬ 
tions exist in the prediction results. The tedious calculations 
for deriving the cutoff value can be avoided by restricting the 
sizes of the scanning regions and only selecting the site with 
the maximum score within this region. The annotated cleav¬ 
age sites often correspond to the highest scores measured by 
the PWM method. However, the sites scored high by the 


PWM method do not always correspond to the cleavage sites 
and vice versa. Restricting the scanning regions for each of the 
cleavage sites is more efficient to reduce the false positive 
prediction rate. For the prediction of the 3C-like proteinase 
cleavage sites, there are only two conflicts between the pre¬ 
dicted results and the annotations, which are marked in Table 
3. The first conflict lies in the locations of non-structural pro¬ 
teins downstream of nsp9 in TGEV, which may be due to the 
problematic annotation. The length of amino acid sequences 
for ORF lab (315-20 368 bp) should be 6684 aa, instead of 
6685 aa, which is annotated by NCBI. The questionable addi¬ 
tional insertion of an amino acid residue in nsp9 causes one 
conflict of location errors. The second is caused by a non¬ 
standard frameshift site in PEDV, which causes the difference 
of five amino acid residues between the non-standard frame- 
shift site and the standard frameshift site. For this reason, the 
octapeptide predicted by the present method is SIMQ | STDM 
instead of the annotated SIMQISTDY. 

Using the cleavage model of the papain-like proteinase pre¬ 
sented here, the additional cleavage sites in the annotated PCP 
CP3 predicted by this method for SARS-CoV TOR2, MHV 
and IBV are ISLKGGIKIVSTC, FSLKGGIAVFSRM and 
VEKKAGIGIVSGT, respectively. The predicted cleavable 
peptides are similar to those annotated by NCBI, for example, 
the cleavable peptide FSLKGGI AVFSRM in MHV is differ¬ 
ent from the annotated peptide FSLKGG | AVFSYF in BCoV 
only at the P 5 < and P 6 < positions. Comparison between the 
N-terminal sequences of the polyprotein labs in MHV and 
BCoV is shown in Fig. 1. The additional cleavage site in the 
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annotated PCP CP3 predicted by this method for MHV is 
situated at the corresponding position where the PCP CP3 
and PCP CP4 are cleaved in BCoV. Cleavage sites that have 
been annotated by NCBI are indicated by black arrows, 
whereas that predicted by the present method is indicated 
by the open arrow. Therefore, the annotated PCP CP3 of 
SARS-CoV TOR2, MHV and IBV may be a precursor, which 
can be cleaved further. 

Based on the present method, the genomes without anno¬ 
tation have been annotated. To save printing space, only the 
results of BCoVL and SARS-CoV BJ01 are summarized in 
Table 4. The detailed annotations for other coronavirus ge¬ 
nomes are accessible at http://tubic.tju.edu.cn/sars/. 

4. Conclusion 

SARS is an extremely severe disease, which has spread to 
many countries around the world. Evidence shows that SARS 
is caused by a new coronavirus, i.e. SARS-CoV. A system, 
called ZCURVE_CoV 1.0, has been developed previously to 
recognize protein-coding genes in coronavirus genomes, espe¬ 
cially suitable for SARS-CoV genomes [10]. Here an improved 
version of the system, ZCURVE_CoV 2.0, has been developed 
to identify all the non-structural proteins cleaved by viral pro- 
teinases in the polyproteins. Consequently, all the non-struc¬ 
tural proteins in the 24 completely sequenced coronavirus ge¬ 
nomes are predicted. Compared with the known annotations, 
including those based on experimental evidence, the perfor¬ 
mance of the present method is satisfactory. 


Acknowledgements: We are indebted to Prof. Jingchu Luo of Peking 
University for the timely updated SARS-related information pro¬ 
vided. We are also grateful to both referees for their constructive 
comments, which are very useful to improve the quality of the paper. 
Invaluable assistance from Ren Zhang is gratefully acknowledged. 
The present study was supported in part by the 973 Project of China 
(Grant 1999075606). 

References 

[1] Peiris, J.S. et al. (2003) Lancet 361, 1319-1325. 

[2] Ksiazek, T.G. et al. (2003) New Engl. J. Med. 348, 1953-1966. 

[3] Drosten, C. et al. (2003) New Engl. J. Med. 348, 1967-1976. 

[4] Tsang, K.W. et al. (2003) New Engl. J. Med. 348, 1977-1985. 

[5] Lee. N. et al. (2003) New Engl. J. Med. 348, 1986-1994. 

[6] Poutanen, S.M. et al. (2003) New Engl. J. Med. 348, 1995-2005. 

[7] Rota, P.A. et al. (2003) Science 300, 1394-1398. 

[8] Marra, M.A. et al. (2003) Science 300, 1399-1404. 

[9] Qin, E’d. et al. (2003) Chin. Sci. Bull. 48, 941-948. 

[10] Chen, L.L., Ou, H.Y., Zhang, R. and Zhang, C.-T. (2003) Bio- 
chem. Biophys. Res. Commun. 307, 382-388. 

[11] Ziebuhr, J., Snijder, E.J. and Gorbalenya, A.E. (2000) J. Gen. 
Virol. 81, 853-879. 

[12] von Heijne, G. (1986) Nucleic Acids Res. 14, 4683M690. 

[13] von Grotthuss, M., Wyrwicz, L.S. and Rychlewski, L. (2003) Cell 
113, 701-702. 

[14] Brierley, I.. Jenner, A.J. and Inglis, S.C. (1992) J. Mol. Biol. 227, 
463M79. 

[15] Nam. S.H.. Copeland, T.D., Hatanaka, M. and Oroszlan, S. 
(1993) J. Virol. 67, 196-203. 

[16] Schneider, T.D. and Stephens, R.M. (1990) Nucleic Acids Res. 
18, 6097-6100. 


